Data-Intensive Text Processing with MapReduce by Jimmy Lin & Chris Dyer
Author:Jimmy Lin & Chris Dyer
Language: eng
Format: epub
Publisher: Morgan & Claypool Publishers
Published: 2012-07-14T16:00:00+00:00
4.5.1 BYTE-ALIGNED AND WORD-ALIGNED CODES
In most programming languages, an integer is encoded in four bytes and holds a value between 0 and 232 − 1, inclusive. We limit our discussion to unsigned integers, since d-gaps are always positive (and greater than zero). This means that 1 and 4,294,967,295 both occupy four bytes. Obviously, encoding d-gaps this way doesn’t yield any reductions in size.
A simple approach to compression is to only use as many bytes as is necessary to represent the integer. This is known as variable-length integer coding (varInt for short) and accomplished by using the high order bit of every byte as the continuation bit, which is set to one in the last byte and zero elsewhere. As a result, we have 7 bits per byte for coding the value, which means that 0 ≤ n < 27 can be expressed with 1 byte, 27 ≤ n < 214 with 2 bytes, 214 ≤ n < 221 with 3, and 221 ≤ n < 228 with 4 bytes. This scheme can be extended to code arbitrarily large integers (i.e., beyond 4 bytes). As a concrete example, the two numbers:
127, 128
would be coded as such:
1 1111111, 0 0000001 1 0000000
The above code contains two code words, the first consisting of 1 byte, and the second consisting of 2 bytes. Of course, the comma and the spaces are there only for readability. Variable-length integers are byte-aligned because the code words always fall along byte boundaries. As a result, there is never any ambiguity about where one code word ends and the next begins. However, the downside of varInt coding is that decoding involves lots of bit operations (masks, shifts). Furthermore, the continuation bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps), which slows down processing.
A variant of the varInt scheme was described by Jeff Dean in a keynote talk at the WSDM 2009 conference.10 The insight is to code groups of four integers at a time. Each group begins with a prefix byte, divided into four 2-bit values that specify the byte length of each of the following integers. For example, the following prefix byte:
00,00,01,10
indicates that the following four integers are one byte, one byte, two bytes, and three bytes, respectively. Therefore, each group of four integers would consume anywhere between 5 and 17 bytes. A simple lookup table based on the prefix byte directs the decoder on how to process subsequent bytes to recover the coded integers. The advantage of this group varInt coding scheme is that values can be decoded with fewer branch mispredicts and bitwise operations. Experiments reported by Dean suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme.
In most architectures, accessing entire machine words is more efficient than fetching all its bytes separately. Therefore, it makes sense to store postings in increments of 16-bit, 32-bit, or 64-bit machine words. Anh and Moffat [8] presented several word-aligned coding methods, one of which is called Simple-9, based on 32-bit words.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Computer Vision & Pattern Recognition | Expert Systems |
Intelligence & Semantics | Machine Theory |
Natural Language Processing | Neural Networks |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8259)
Test-Driven Development with Java by Alan Mellor(6390)
Data Augmentation with Python by Duc Haba(6288)
Principles of Data Fabric by Sonia Mezzetta(6065)
Hadoop in Practice by Alex Holmes(5939)
Learn Blender Simulations the Right Way by Stephen Pearson(5926)
Microservices with Spring Boot 3 and Spring Cloud by Magnus Larsson(5813)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(5784)
RPA Solution Architect's Handbook by Sachin Sahgal(5209)
Big Data Analysis with Python by Ivan Marin(5178)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5105)
The Infinite Retina by Robert Scoble Irena Cronin(4900)
Pretrain Vision and Large Language Models in Python by Emily Webber(4155)
Functional Programming in JavaScript by Mantyla Dan(4019)
The Age of Surveillance Capitalism by Shoshana Zuboff(3915)
Infrastructure as Code for Beginners by Russ McKendrick(3912)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3616)
Embracing Microservices Design by Ovais Mehboob Ahmed Khan Nabil Siddiqui and Timothy Oleson(3427)
Applied Machine Learning for Healthcare and Life Sciences Using AWS by Ujjwal Ratan(3402)
